AUC / ROC AUC (Area Under the ROC Curve)#
AUC-ROC is a ranking metric for binary classification. It answers a simple question:
If we randomly pick one positive and one negative example, what’s the probability the model assigns a higher score to the positive one?
This notebook focuses on ROC AUC (the most common meaning of “AUC” in classification).
Goals
Build the ROC curve from the confusion matrix by sweeping a threshold
Compute AUC via the trapezoidal rule (what
sklearn.metrics.aucdoes)Implement
roc_curve+roc_auc_scorefrom scratch in NumPyVisualize why AUC is threshold-free and how it relates to ranking
Use an AUC-inspired surrogate loss to train a simple linear model
Quick import (scikit-learn)
from sklearn.metrics import roc_auc_score, roc_curve, auc
Assumption throughout: binary labels y ∈ {0,1} and a real-valued score s(x) where larger means “more positive”.
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import auc, roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Intuition: AUC is about ordering, not thresholds#
Suppose your model outputs a score \(s(x)\) (a probability, logit, margin, etc.).
If we pick a threshold \(\tau\), we turn scores into hard predictions.
If we sweep \(\tau\) from \(+\infty\) down to \(-\infty\), we trace out the ROC curve.
AUC summarizes the entire ROC curve into one number in \([0,1]\).
Two key takeaways:
Only ranking matters: any strictly increasing transform \(g\) preserves AUC.
\[\mathrm{AUC}(s) = \mathrm{AUC}(g\circ s)\]Pairwise probability interpretation:
\[ \mathrm{AUC} = \mathbb{P}(s(X^+) > s(X^-)) + \tfrac{1}{2}\,\mathbb{P}(s(X^+) = s(X^-)) \]where \(X^+\) is a random positive example and \(X^-\) is a random negative example.
# A toy "score" example: overlapping score distributions
n_pos, n_neg = 400, 600
pos_scores = rng.normal(loc=1.0, scale=1.0, size=n_pos)
neg_scores = rng.normal(loc=0.0, scale=1.0, size=n_neg)
y_true = np.r_[np.ones(n_pos, dtype=int), np.zeros(n_neg, dtype=int)]
y_score = np.r_[pos_scores, neg_scores]
toy_auc = roc_auc_score(y_true, y_score)
print(f"Toy ROC AUC = {toy_auc:.3f}")
fig = px.histogram(
{"score": y_score, "class": np.where(y_true == 1, "positive", "negative")},
x="score",
color="class",
opacity=0.6,
barmode="overlay",
marginal="box",
title="Toy scores: overlap between classes",
labels={"score": "model score s(x)", "class": "true class"},
)
fig.show()
Toy ROC AUC = 0.772
2) From thresholds to the ROC curve#
Given a threshold \(\tau\), convert scores to predicted labels:
Confusion-matrix counts (as functions of \(\tau\)):
Normalize to rates:
where \(P\) is the number of positives and \(N\) the number of negatives.
The ROC curve is the set of points:
def confusion_counts_at_threshold(y_true, y_score, threshold):
y_true = np.asarray(y_true).astype(int)
y_score = np.asarray(y_score)
y_pred = (y_score >= threshold).astype(int)
tp = int(np.sum((y_true == 1) & (y_pred == 1)))
fp = int(np.sum((y_true == 0) & (y_pred == 1)))
tn = int(np.sum((y_true == 0) & (y_pred == 0)))
fn = int(np.sum((y_true == 1) & (y_pred == 0)))
return tp, fp, tn, fn
for thr in [2.0, 1.0, 0.0, -1.0]:
tp, fp, tn, fn = confusion_counts_at_threshold(y_true, y_score, threshold=thr)
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)
print(f"thr={thr:>4}: TP={tp:>3} FP={fp:>3} TN={tn:>3} FN={fn:>3} | TPR={tpr:.3f} FPR={fpr:.3f}")
thr= 2.0: TP= 53 FP= 11 TN=589 FN=347 | TPR=0.133 FPR=0.018
thr= 1.0: TP=203 FP= 88 TN=512 FN=197 | TPR=0.507 FPR=0.147
thr= 0.0: TP=338 FP=299 TN=301 FN= 62 | TPR=0.845 FPR=0.498
thr=-1.0: TP=392 FP=491 TN=109 FN= 8 | TPR=0.980 FPR=0.818
fpr, tpr, thresholds = roc_curve(y_true, y_score)
roc_auc = auc(fpr, tpr)
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=fpr,
y=tpr,
mode="lines",
name=f"ROC (AUC = {roc_auc:.3f})",
line=dict(width=3),
)
)
fig.add_trace(
go.Scatter(
x=[0, 1],
y=[0, 1],
mode="lines",
name="Random (AUC = 0.5)",
line=dict(dash="dash", color="gray"),
)
)
# Color ROC points by threshold (skip the first threshold = +inf)
fig.add_trace(
go.Scatter(
x=fpr[1:],
y=tpr[1:],
mode="markers",
name="Threshold points",
marker=dict(
size=6,
color=thresholds[1:],
colorscale="Viridis",
showscale=True,
colorbar=dict(title="threshold"),
),
)
)
fig.update_layout(
title="ROC curve (TPR vs FPR) from a threshold sweep",
xaxis_title="False Positive Rate (FPR)",
yaxis_title="True Positive Rate (TPR)",
width=750,
height=520,
)
# Make the plot square-ish so the diagonal really looks like 45 degrees
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()
3) AUC = area under the ROC curve (trapezoidal rule)#
Once you have ROC points \((x_k, y_k)\) with \(x_k\) = FPR and \(y_k\) = TPR sorted by increasing FPR, you can approximate the integral using trapezoids:
This is exactly what sklearn.metrics.auc(x, y) computes (for any curve, not only ROC).
def auc_trapezoid(x, y):
'''Compute area under a curve via the trapezoidal rule.
Parameters
----------
x, y : 1D arrays of same length
x must be monotonically increasing.
'''
x = np.asarray(x, dtype=float)
y = np.asarray(y, dtype=float)
if x.ndim != 1 or y.ndim != 1 or x.shape != y.shape:
raise ValueError("x and y must be 1D arrays with the same shape")
dx = np.diff(x)
if np.any(dx < 0):
raise ValueError("x must be monotonically increasing")
return float(np.sum(dx * (y[1:] + y[:-1]) / 2.0))
print(f"sklearn auc(fpr,tpr) = {auc(fpr, tpr):.10f}")
print(f"NumPy auc_trapezoid(fpr,tpr) = {auc_trapezoid(fpr, tpr):.10f}")
sklearn auc(fpr,tpr) = 0.7715416667
NumPy auc_trapezoid(fpr,tpr) = 0.7715416667
4) From scratch: ROC curve and ROC AUC in NumPy#
A convenient way to build the ROC curve:
Sort examples by score (descending).
Sweep a threshold from high to low (equivalently: move down the sorted list).
Track how many positives/negatives we’ve included so far.
Record \((\mathrm{FPR},\mathrm{TPR})\) whenever the score changes.
This produces the same step-shaped ROC curve you get from scikit-learn.
def roc_curve_numpy(y_true, y_score):
'''Compute ROC curve (FPR, TPR, thresholds) for binary classification.
Matches the standard approach used by scikit-learn:
- thresholds are unique score values in descending order, with an initial +inf
- the returned curve starts at (0,0)
Parameters
----------
y_true : array-like of shape (n,)
Binary labels in {0,1}
y_score : array-like of shape (n,)
Scores where larger means more positive
'''
y_true = np.asarray(y_true)
y_score = np.asarray(y_score)
if y_true.shape != y_score.shape:
raise ValueError("y_true and y_score must have the same shape")
unique = np.unique(y_true)
if not set(unique.tolist()).issubset({0, 1}):
raise ValueError("y_true must be binary with values in {0,1}")
y_true = y_true.astype(int)
n_pos = int(np.sum(y_true == 1))
n_neg = int(np.sum(y_true == 0))
if n_pos == 0 or n_neg == 0:
raise ValueError("ROC is undefined when only one class is present")
# Sort by score descending (stable sort so ties are handled consistently)
order = np.argsort(-y_score, kind="mergesort")
y_true_sorted = y_true[order]
y_score_sorted = y_score[order]
# Indices where score changes (each threshold is a distinct score value)
distinct = np.where(y_score_sorted[1:] != y_score_sorted[:-1])[0]
threshold_idxs = np.r_[distinct, y_true_sorted.size - 1]
tps = np.cumsum(y_true_sorted)[threshold_idxs]
fps = (threshold_idxs + 1) - tps
tpr = tps / n_pos
fpr = fps / n_neg
thresholds = y_score_sorted[threshold_idxs]
# Prepend the (0,0) point with threshold = +inf
tpr = np.r_[0.0, tpr]
fpr = np.r_[0.0, fpr]
thresholds = np.r_[np.inf, thresholds]
return fpr, tpr, thresholds
def roc_auc_score_numpy(y_true, y_score):
fpr, tpr, _ = roc_curve_numpy(y_true, y_score)
return auc_trapezoid(fpr, tpr)
fpr_np, tpr_np, thr_np = roc_curve_numpy(y_true, y_score)
auc_np = roc_auc_score_numpy(y_true, y_score)
print(f"sklearn roc_auc_score = {roc_auc_score(y_true, y_score):.10f}")
print(f"NumPy roc_auc_score = {auc_np:.10f}")
# Sanity check: same AUC
assert np.isclose(auc_np, roc_auc_score(y_true, y_score))
sklearn roc_auc_score = 0.7715416667
NumPy roc_auc_score = 0.7715416667
5) AUC as a rank statistic (Mann–Whitney / Wilcoxon)#
The “pairwise probability” view can be written explicitly as:
There’s an equivalent computation using ranks. Let \(r_i\) be the rank of \(s_i\) among all scores (rank 1 = smallest score), using average ranks for ties. Then:
This is useful because it can be computed in \(O(n\log n)\) via sorting.
def rankdata_average_ties(x):
'''Rank data with average ranks for ties (1 = smallest).'''
x = np.asarray(x)
order = np.argsort(x, kind="mergesort")
x_sorted = x[order]
ranks_sorted = np.empty(x_sorted.shape[0], dtype=float)
i = 0
n = x_sorted.shape[0]
while i < n:
j = i
while j + 1 < n and x_sorted[j + 1] == x_sorted[i]:
j += 1
# ranks are 1..n, positions i..j correspond to ranks (i+1)..(j+1)
avg_rank = 0.5 * ((i + 1) + (j + 1))
ranks_sorted[i : j + 1] = avg_rank
i = j + 1
ranks = np.empty_like(ranks_sorted)
ranks[order] = ranks_sorted
return ranks
def roc_auc_score_rank_numpy(y_true, y_score):
y_true = np.asarray(y_true).astype(int)
y_score = np.asarray(y_score)
n_pos = int(np.sum(y_true == 1))
n_neg = int(np.sum(y_true == 0))
if n_pos == 0 or n_neg == 0:
raise ValueError("ROC AUC is undefined when only one class is present")
ranks = rankdata_average_ties(y_score)
sum_pos_ranks = float(np.sum(ranks[y_true == 1]))
return (sum_pos_ranks - n_pos * (n_pos + 1) / 2.0) / (n_pos * n_neg)
auc_rank = roc_auc_score_rank_numpy(y_true, y_score)
print(f"ROC AUC (rank formula) = {auc_rank:.10f}")
print(f"ROC AUC (ROC+trapz) = {roc_auc_score_numpy(y_true, y_score):.10f}")
assert np.isclose(auc_rank, roc_auc_score_numpy(y_true, y_score))
# Direct pairwise computation: O(P*N), only feasible for small n
pos_idx = np.flatnonzero(y_true == 1)
neg_idx = np.flatnonzero(y_true == 0)
sub_idx = np.r_[
rng.choice(pos_idx, size=60, replace=False),
rng.choice(neg_idx, size=60, replace=False),
]
sub_idx = rng.permutation(sub_idx)
y_sub = y_true[sub_idx]
s_sub = y_score[sub_idx]
pos = s_sub[y_sub == 1]
neg = s_sub[y_sub == 0]
diffs = pos[:, None] - neg[None, :]
auc_pairwise = (np.sum(diffs > 0) + 0.5 * np.sum(diffs == 0)) / (pos.size * neg.size)
print(f"ROC AUC (pairwise, small) = {auc_pairwise:.10f}")
assert np.isclose(auc_pairwise, roc_auc_score_rank_numpy(y_sub, s_sub))
ROC AUC (rank formula) = 0.7715416667
ROC AUC (ROC+trapz) = 0.7715416667
ROC AUC (pairwise, small) = 0.7977777778
Monotonic transforms don’t change AUC#
Because AUC depends only on the ordering of scores, any strictly increasing transform keeps the ROC curve (and AUC) the same.
def softplus(z):
# stable log(1 + exp(z))
return np.logaddexp(0.0, z)
s = y_score
s_transformed = softplus(3.0 * s) # strictly increasing
print(f"AUC(s) = {roc_auc_score_numpy(y_true, s):.10f}")
print(f"AUC(softplus(3s)) = {roc_auc_score_numpy(y_true, s_transformed):.10f}")
AUC(s) = 0.7715416667
AUC(softplus(3s)) = 0.7715416667
6) A threshold view: how TPR and FPR change as you move the cutoff#
ROC plots TPR vs FPR, but sometimes it’s useful to see both rates as explicit functions of the threshold \(\tau\).
# Use the NumPy ROC implementation so we have (fpr,tpr,threshold) aligned
fpr_np, tpr_np, thr_np = roc_curve_numpy(y_true, y_score)
# Drop the first threshold (+inf) for plotting
thr_plot = thr_np[1:]
tpr_plot = tpr_np[1:]
fpr_plot = fpr_np[1:]
fig = go.Figure()
fig.add_trace(go.Scatter(x=thr_plot, y=tpr_plot, mode="lines+markers", name="TPR"))
fig.add_trace(go.Scatter(x=thr_plot, y=fpr_plot, mode="lines+markers", name="FPR"))
fig.update_layout(
title="TPR and FPR as functions of the threshold",
xaxis_title="threshold τ (higher = stricter)",
yaxis_title="rate",
width=800,
height=450,
)
# thresholds are in descending score order; flip the x-axis so we read left→right as τ decreases
fig.update_xaxes(autorange="reversed")
fig.show()
7) Practical usage: ROC AUC for a classifier#
In practice, you usually:
Train a model with a differentiable loss (log-loss, hinge, etc.).
Evaluate ROC AUC on validation/test using scores (
predict_probaordecision_function).
Important: AUC expects scores, not hard class labels.
X, y = make_classification(
n_samples=2500,
n_features=10,
n_informative=4,
n_redundant=0,
weights=[0.85, 0.15],
class_sep=1.0,
random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, stratify=y, random_state=42
)
clf = make_pipeline(
StandardScaler(),
LogisticRegression(max_iter=2000),
)
clf.fit(X_train, y_train)
# Probability scores for the positive class
y_score_test = clf.predict_proba(X_test)[:, 1]
auc_test = roc_auc_score(y_test, y_score_test)
print(f"Test ROC AUC = {auc_test:.3f}")
fpr_test, tpr_test, _ = roc_curve(y_test, y_score_test)
fig = go.Figure()
fig.add_trace(go.Scatter(x=fpr_test, y=tpr_test, mode="lines", name=f"Test ROC (AUC={auc_test:.3f})"))
fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode="lines", line=dict(dash="dash", color="gray"), name="Random"))
fig.update_layout(
title="ROC curve on a held-out test set",
xaxis_title="False Positive Rate (FPR)",
yaxis_title="True Positive Rate (TPR)",
width=750,
height=520,
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()
# Same AUC via our NumPy implementation
print(f"NumPy ROC AUC = {roc_auc_score_numpy(y_test, y_score_test):.10f}")
Test ROC AUC = 0.733
NumPy ROC AUC = 0.7329897748
8) Using AUC for optimization (via a smooth surrogate)#
The “true” AUC objective is based on an indicator:
This is hard to optimize directly because the indicator is non-differentiable and the double sum over all pairs is \(O(PN)\).
A common trick: replace the indicator with a pairwise surrogate loss. For a linear scoring model \(s_w(x)=w^\top x\), define \(d_{ij}=w^\top(x_i-x_j)\) and use the pairwise logistic loss:
Then minimize:
The gradient of one pair is:
where \(\sigma(z)=\frac{1}{1+e^{-z}}\) is the sigmoid.
In code, we approximate the full double sum with mini-batches of random positive-negative pairs.
def sigmoid(z):
z = np.asarray(z, dtype=float)
out = np.empty_like(z)
pos = z >= 0
out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
exp_z = np.exp(z[~pos])
out[~pos] = exp_z / (1.0 + exp_z)
return out
def standardize_fit(X):
mean = X.mean(axis=0)
std = X.std(axis=0)
std = np.where(std == 0, 1.0, std)
return mean, std
def standardize_transform(X, mean, std):
return (X - mean) / std
def add_intercept(X):
return np.concatenate([np.ones((X.shape[0], 1)), X], axis=1)
# Prepare a smaller dataset for fast training curves
X_small, y_small = make_classification(
n_samples=1400,
n_features=8,
n_informative=3,
n_redundant=0,
weights=[0.9, 0.1],
class_sep=0.9,
flip_y=0.03,
random_state=7,
)
X_tr, X_va, y_tr, y_va = train_test_split(
X_small, y_small, test_size=0.35, stratify=y_small, random_state=7
)
mean, std = standardize_fit(X_tr)
X_tr = standardize_transform(X_tr, mean, std)
X_va = standardize_transform(X_va, mean, std)
X_tr = add_intercept(X_tr)
X_va = add_intercept(X_va)
def train_logistic_ce(X, y, X_val, y_val, lr=0.1, reg=1e-3, epochs=30, batch_size=256, seed=0):
rng_local = np.random.default_rng(seed)
n, d = X.shape
w = np.zeros(d)
auc_hist = []
for _ in range(epochs):
idx = rng_local.permutation(n)
for start in range(0, n, batch_size):
batch = idx[start : start + batch_size]
xb = X[batch]
yb = y[batch]
logits = xb @ w
p = sigmoid(logits)
grad = (xb.T @ (p - yb)) / xb.shape[0] + reg * w
w -= lr * grad
auc_hist.append(roc_auc_score_numpy(y_val, X_val @ w))
return w, np.array(auc_hist)
def train_auc_pairwise(
X,
y,
X_val,
y_val,
lr=0.1,
reg=1e-3,
epochs=30,
steps_per_epoch=200,
batch_pairs=512,
seed=0,
):
rng_local = np.random.default_rng(seed)
n, d = X.shape
w = np.zeros(d)
pos_idx = np.flatnonzero(y == 1)
neg_idx = np.flatnonzero(y == 0)
if pos_idx.size == 0 or neg_idx.size == 0:
raise ValueError("Need both classes for AUC training")
auc_hist = []
for _ in range(epochs):
for _ in range(steps_per_epoch):
pi = rng_local.choice(pos_idx, size=batch_pairs, replace=True)
ni = rng_local.choice(neg_idx, size=batch_pairs, replace=True)
diff = X[pi] - X[ni]
d_scores = diff @ w
# loss per pair: log(1 + exp(-d)) = softplus(-d)
# grad: -sigmoid(-d) * diff
weight = sigmoid(-d_scores)
grad = -(weight[:, None] * diff).mean(axis=0) + reg * w
w -= lr * grad
auc_hist.append(roc_auc_score_numpy(y_val, X_val @ w))
return w, np.array(auc_hist)
w_ce, auc_ce = train_logistic_ce(
X_tr,
y_tr,
X_va,
y_va,
lr=0.2,
reg=1e-3,
epochs=30,
batch_size=256,
seed=1,
)
w_auc, auc_auc = train_auc_pairwise(
X_tr,
y_tr,
X_va,
y_va,
lr=0.2,
reg=1e-3,
epochs=30,
steps_per_epoch=150,
batch_pairs=512,
seed=1,
)
print(f"Final val AUC (cross-entropy training) = {auc_ce[-1]:.3f}")
print(f"Final val AUC (pairwise AUC training) = {auc_auc[-1]:.3f}")
Final val AUC (cross-entropy training) = 0.935
Final val AUC (pairwise AUC training) = 0.945
fig = go.Figure()
fig.add_trace(go.Scatter(y=auc_ce, mode="lines+markers", name="Train w/ log-loss (CE)"))
fig.add_trace(go.Scatter(y=auc_auc, mode="lines+markers", name="Train w/ pairwise AUC surrogate"))
fig.update_layout(
title="Validation ROC AUC during training",
xaxis_title="epoch",
yaxis_title="ROC AUC",
width=850,
height=450,
yaxis=dict(range=[0.45, 1.0]),
)
fig.show()
9) Pros, cons, and when to use ROC AUC#
Pros
Threshold-free: summarizes performance across all possible thresholds.
Ranking interpretation: \(\mathrm{AUC}\approx \mathbb{P}(s(X^+) > s(X^-))\) is intuitive.
Insensitive to score calibration: if your ranking is good but probabilities aren’t calibrated, AUC can still be high.
Works well for model comparison when you don’t yet know the operating threshold.
Cons / pitfalls
Not about calibration: a model can have great AUC and terrible probability estimates.
May be misleading under heavy class imbalance if you care about the precision regime (consider PR AUC).
Averages over regions you may not care about (e.g., you might only tolerate FPR < 1%).
Hard to optimize directly: the exact AUC objective is non-smooth and pairwise.
Depends on label meaning and score direction: flipping the positive class changes AUC to \(1-\mathrm{AUC}\).
Good use cases
When you care about ranking (who is more likely positive) more than a single fixed decision threshold.
When you’ll choose the operating threshold later (business constraints, costs, etc.).
As a model-selection metric for classifiers that output usable scores.
10) Exercises#
Create a highly imbalanced dataset (e.g. 1% positives) and compare ROC AUC vs PR AUC.
Show two models with the same ROC AUC but very different TPR when FPR < 0.01.
Extend the NumPy ROC code to support sample weights.
For the pairwise AUC training, try a hinge loss \(\ell(d)=\max(0, 1-d)\) and compare curves.
References#
scikit-learn:
roc_curve,roc_auc_score,auchttps://scikit-learn.org/stable/modules/model_evaluation.html
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html
Tom Fawcett (2006): An introduction to ROC analysis
Wilcoxon / Mann–Whitney rank-sum statistic interpretation of AUC